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Abstract 

For the estimation of cumulative link models for ordinal data, the bias-reducing 



adjusted score equations in Firth (19931 are obtained, whose solution ensures an es- 
timator with smaller asymptotic bias than the maximum likelihood estimator. Their 
form suggests a parameter-dependent adjustment of the multinomial counts, which, 
in turn suggests the solution of the adjusted score equations through iterated max- 
imum likelihood fits on adjusted counts, greatly facilitating implementation. Like 
the maximum likelihood estimator, the reduced-bias estimator is found to respect the 
invariance properties that make cumulative link models a good choice for the anal- 
ysis of categorical data. Its additional finiteness and optimal frequentist properties, 
along with the adequate behaviour of related asymptotic inferential procedures make 
the reduced-bias estimator attractive as a default choice for practical applications. 
Furthermore, the proposed estimator enjoys certain shrinkage properties that are de- 
fensible from an experimental point of view relating to the nature of ordinal data. 
Key words: reduction of bias, adjusted score equations, adjusted counts, shrinkage, 
ordinal response models. 



1 Introduction 



In many models with categorical responses the maximum likelihood estimates can be on 
the boundary of the parameter space with positive probability. For example, Albert and 



Anderson (1984) derive the conditions that describe when the maximum likelihood esti- 



mates are on the boundary in multinomial logistic regression models. While there is no 
ambiguity in reporting an estimate on the boundary of the parameter space, as is for 
example an infinite estimate for the parameters of a logistic regression model, estimates 
on the boundary can (i) cause numerical instabilities to fitting procedures, (ii) lead to 
misleading output when estimation is based on iterative procedures with a stopping cri- 
terion, and more importantly, (hi) cause havoc to asymptotic inferential procedures, and 
especially to the ones that depend on estimates of the standard error of the estimators 
(for example, Wald tests and related confidence intervals). 



The maximum likelihood estimator in cumulative link models for ordinal data (Mo 



Cullagh, 1980) also has a positive probability of being on the boundary. 



Example 1.1: As a demonstration consider the example in Christensen (2012a, §7). The 



data set in Table [T] comes from Randall ( 1989 ) and concerns a factorial experiment for 
investigating factors that affect the bitterness of white wine. There are two factors in 
the experiment, temperature at the time of crushing the grapes (with two levels, "cold" 
and "warm") and contact of the juice with the skin (with two levels "Yes" and "No"). 
For each combination of factors two bottles were rated on their bitterness by a panel 



Table 1: The top table contains the wine tasting data (Randall, 1989) (top). The bottom 
table contains the maximum likelihood estimates for the parameters of model Q, the 
corresponding estimated standard errors (in parenthesis) and the values of the Z statistic 
(bottom) for the hypothesis that the corresponding parameter is zero. 



Temperature Contact 



Bitterness scale 
1 2 3 4 5 



Cold No 4 9 5 

Cold Yes 1 7 8 2 

Warm No 5 8 3 2 

Warm Yes 15 7 5 

Parameter ML estimates Z-statistic 



Oil 


-1.27 


(0.51) 


-2.46 


a 2 


1.10 


(0.44) 


2.52 


Oi3 


3.77 


(0.80) 


4.68 


Oti 


28.90 


(193125.63) 


0.00 


Pi 


25.10 


(112072.69) 


0.00 


P2 


2.15 


(0.59) 


3.65 


P 3 


2.87 


(0.82) 


3.52 


Pa 


26.55 


(193125.63) 


0.00 


6 


1.47 


(0.47) 


3.13 



of 9 judges. The responses of the judges on the bitterness of the wine were taken on a 
continuous scale in the interval from ("None") to 100 ("Intense") and then they were 
grouped correspondingly into 5 ordered categories, 1, 2, 3, 4, 5. 



Consider the partial proportional odds model (Peterson and Harrell, 1990) with 



log T ^ r ^ = a s - faw r - Qz r (r = l,...,4;s = l,...,4), (1) 

1 - Irs 

where w r and z r are dummy variables representing the factors temperature and contact, 
respectively, a±, . . . ,04, fa 9 are model parameters and j rs is the cumulative probabil- 
ity for the sth category at the rth combination of levels for temperature and contact. 



The elm function of the R package ordinal ( Christensen 2012b) is used to maximize 
the multinomial likelihood that corresponds to model ([Tj). The elm function finds the 
maximum likelihood estimates up to a specified accuracy, by using a Newton-Raphson 
iteration for finding the roots of the log-likelihood derivatives. For the current example, 
the stopping criterion is set to that the log-likelihood derivatives are less than 10 -10 in 
absolute value. The maximum likelihood estimates, estimated standard errors and the 
corresponding values for the Z statistics for the hypothesis that the respective parameter 
is zero, are extracted from the software output and shown in Table [TJ At those values for 
the maximum likelihood estimator the maximum absolute log-likelihood derivative is less 
than 10~ 10 and the software correctly reports convergence. Nevertheless, an immediate 
observation is that the absolute value of the estimates and estimated standard errors for 
the parameters 04, fa and fa is very large. Actually, these would diverge to infinity as the 
stopping criteria of the iterative fitting procedure used become stricter and the number of 
allowed iterations increases. 

For model ([!]) interest usually is on testing departures from the assumption of propor- 
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tional odds via the hypothesis fi\ = 02 = fis = fy. Using a Wald-type statistic would be 
adventurous here because such a statistic explicitly depends on the estimates of /3±, $2 , P3 
and /?4. Of course, given that the likelihood is close to its maximal value at the estimates 
in Table [TJ a likelihood ratio test can be used instead; the likelihood ratio test for the 



particular example has been carried out in Christensen (2012a, §7). 



Furthermore, the current example demonstrates some of the potential dangers involved 
in the application of cumulative link models in general; the behaviour of the individual 
Z statistics — being essentially for the parameters /3i and /?4 in this example — is 
quite typical of what happens when estimates diverge to infinity. The values of the Z 
statistics converge to zero because the estimated standard errors diverge much faster 
than the estimates, irrespective of whether or not there is evidence against the individual 
hypotheses. This behaviour is also true for individual hypotheses at values other than zero 
and can lead to invalid conclusions if the output is interpreted naively. More importantly, 



the presence of three infinite standard errors in a non-orthogonal (in the sense of Cox and 



Reid, 1987) setting like the current may affect the estimates of the standard errors for 



other parameters in ways that are hard to predict. □ 



An apparent solution to the issues mentioned in Example 1.1 is to use a different esti- 
mator that has probability zero of resulting in estimates on the boundary of the parameter 
space. For example, for the estimation of the common difference in cumulative logits from 
ordinal data arranged in a 2 x k contingency table with fixed row totals, McCullagh (1980) 
describes the generalized empirical logistic transform. The generalized empirical logistic 
transform has smaller asymptotic bias than the maximum likelihood estimator and is also 
guaranteed to give finite estimates of the difference in cumulative logits because it adjusts 
all cumulative counts by 1/2. However, the applicability of this estimator is limited to 
the analysis of 2 x k tables, and particularly in estimating differences in cumulative logits, 
with no obvious extension to more general cumulative link models, such as the one in 
Example |1.1| 

A family of estimators that can be used for arbitrary cumulative link models and 
are guaranteed to be finite are ridge estimators. As one of the referees highlighted, if 
one extends the work in le Cessie and van Houwelingen (1992) for logistic regression to 
cumulative link models, then the shrinkage properties of the ridge estimator can guarantee 
its finiteness. Nevertheless, ridge estimators have a series of shortcomings. Firstly, in 
contrast to the maximum likelihood estimator, the ridge estimators are not generally 
equivariant under general linear transformations of the parameters for cumulative link 
models. A reparameterization of the model by re-scaling the parameters or taking contrasts 
of those — which are often interesting transformations in cumulative link models — and 
a corresponding transformation of the ridge estimator will not generally result to the 
estimator that the same ridge penalty would produce for the reparameterized model, 
unless the penalty is also appropriately adjusted. For example, if one wishes to test the 
hypothesis in Example 1 1 . 1 1 using a Wald test, then an appropriate ridge estimator would be 
one that penalizes the size of the contrasts of /?i , 02 , and /?4 instead of the size of those 
parameters themselves. Secondly, the choice of the tuning parameter is usually performed 
through the use of an optimality criterion and cross-validation. Hence, the properties of the 
resultant estimators are sensitive to the choice of the criterion. For example, criteria like 
mean-squared error of predictions, classification error, and log-likelihood that have been 
discussed in le Cessie and van Houwelingen (1992) will each produce different results, as is 
also shown in the latter study. Furthermore, the resultant ridge estimator is sensitive to the 
type of cross-validation used. For example, /c-fold cross-validation will produce different 
results for different choices of k. Lastly, standard asymptotic inferential procedures for 
performing hypothesis tests and constructing confidence intervals cannot be used by simply 
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replacing the maximum likelihood estimator with the ridge estimator in the associated 
pivots. For the above reasons, ridge estimators can only offer an ad-hoc solution to the 
problem. 

Several simulation studies on well-used models for discrete responses have demon- 



strated that bias reduction via the adjustment of the log-likelihood derivatives (Firth 



1993) offers a solution to the problems relating to boundary estimates; see, for exam- 



ple, Mehrabi and Matthews (1995) for the estimation of simple complementary log-log 
models, |Heinze and Schemper| fl2002| ) and |Bull et al.| ( |2002[ ); |Kosmidis and Firth| (|2011 ) 
for binomial and multinomial logistic regression, respectively, and Kosmidis (2009) for 
binomial-response generalized linear models. 

In the current paper the aforementioned adjustment is derived and evaluated for the 
estimation of cumulative link models for ordinal responses. It is shown that reduction of 
bias is equivalent to a parameter-dependent additive adjustment of the multinomial counts 
and that such adjustment generalizes well-known constant adjustments in cases like the 
estimation of cumulative logits. Then, the reduced-bias estimates can be obtained through 
iterative maximum likelihood fits to the adjusted counts. The form of the parameter- 
dependent adjustment is also used to show that, like the maximum likelihood estimator, 
the reduced-bias estimator is invariant to the level of sample aggregation present in the 
data. 

Furthermore, it is shown that the reduced-bias estimator respects the invariance prop- 
erties that make cumulative link models an attractive choice for the analysis of ordinal 
data. The finiteness and shrinkage properties of the proposed estimator are illustrated via 
detailed complete enumeration and an extensive simulation exercise. In particular, the 
reduced-bias estimator is found to be always finite, and also the reduction of bias in cu- 
mulative link models results in the shrinkage of the multinomial model towards a smaller 
binomial model for the end-categories. A thorough discussion on the desirable frequentist 
properties of the estimator is provided along with an investigation of the performance of 
associated inferential procedures. 

The finiteness of the reduced-bias estimator, its optimal frequentist properties and 
the adequate performance of the associated inferential procedures lead to its proposal for 
routine use in fitting cumulative link models. 

The exposition of the methodology is accompanied by a parallel discussion of the cor- 
responding implications in the application of the models through examples with artificial 
and real data. 



2 Cumulative link models 

Suppose observations of n k- vectors of counts yx, . . . ,y n , on mutually independent multi- 
nomial random vectors Y±, . . . , Y n , where Y r = (Y r i, . . . , Y r k) T and the k multinomial cat- 
egories are ordered. The multinomial totals for Y r are m r = X^s=i Vrs and the probability 
for the sth category of the rth multinomial vector is 7r r fc, with X^s=i n rs = 1 (r = 1, . . . , n). 
The cumulative link model links the cumulative probability 7 rs = 7r. r i + . . . + 7r rs to a p- 
vector of covariates x r via the relationship 

Irs = G ^a s - Pt x rt\ = 1, • ■ • ,q; r = 1, . . . ,n) , (2) 

where q = k — 1 denotes the number of the non-redundant components of y r , and where 
<5 = (a\, . . . , a q , ... , (3 P ) T is a (p + g)-vector of real-valued model parameters, with 
«!<...< a q . The function G(.) is a monotone increasing function mapping (— oo,+oo) 
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to (0,1), usually chosen to be a known distribution function (like, for example, the lo- 
gistic, extreme value or standard normal distribution function). Then, ai,...,a q can be 
considered as outpoints on the latent scale implied by G. 

Special important cases of cumulative link models are the proportional-odds model 
with G(rj) = exp(ry)/{l + exp(?7)}, and the proportional hazards model in discrete time 
with G(rf) = 1 — exp {— exp(r/)} (see, McCullagh, 1980, for the introduction of and a 



thorough discussion on cumulative link models). 

The cumulative link model can be written in the usual multivariate generalized linear 
models form by writing the relationship that links the cumulative probability j rs to 8 as 

P+Q 

G' 1 (7 rs ) = rj rs = S t z rst (s = 1, . . . ,q; r = 1, . . . ,n) , (3) 
t=l 

where z rs t is the (s, t)th component of the q x (p + q) matrix 



1 
1 





x 



T 



In order to be able to identify 5, the matrix Z with row blocks Z\, . . . , Z n is assumed to 
be of full rank. 

Direct differentiation of the multinomial log-likelihood 1(5) gives that the tth compo- 
nent of the vector of score functions has the form 

n q , \ 

^=EE^ fe-^K- (t=l,...,p + q), (4) 

f^j^ \TTrs(5) TT rs+1 (5)J 



where g rs (5) = g(rj rs ), with g(rj) = dG(rj)/drj. If g(.) is log-concave then Ut( 5) = 
(t = 1, . . . , p + q) has unique solution the maximum likelihood estimate 5 (see, 



Pratt 



1981, where it is shown that the log-concavity of g(.) implies the concavity of 1(5)). 

All generalized linear models for binomial responses that include an intercept param- 
eter in the linear predictor are special cases of model §2§. 



3 Maximum likelihood estimates on the boundary 

The maximum likelihood estimates of the parameters of the cumulative link model can be 
on the boundary of the parameter space with positive probability. Under the log-concavity 
of <?(•), Haberman (1980) gives conditions that guarantee that the maximum likelihood 

[ "exist" in an alternative terminology). Boundary 

, (3 P with an 



estimates are not on the boundary 
estimates for these models are estimates of the regression parameters /3\, 



infinite value, and/or estimates of the outpoints — oo = ao < ot\ < ■ . . < a q < ap. = oo for 
which at least a pair of consecutive cutpoints have equal estimated value. 



As far as the regression parameters (3 are concerned, Agresti (2010, § 3.4.5) gives an 



accessible account on what data settings result in infinite estimates for the regression 
parameters, how a fitted model with such estimates can be used for inference and how 
such problems can be identified from the output of standard statistical software. 



As far as boundary values of the cutpoints a are concerned, Pratt ( 1981 ) showed that 
with a log-concave g(.), the cutpoints a s _i and a s have equal estimates if and only if the 
observed counts for the sth category are zero (s = 1, . . . , k) for all r G {1, . . . , n}. If the 
first or the last category have zero counts then the respective estimates for a\ and a q will 
be — oo and +co, respectively, and if this happens for category s for some s € {2, . . . , q}, 
then the estimates for a s _i and a s will have the same finite value. 
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4 Bias correction and bias reduction 



4.1 Adjusted score functions and first-order bias 

Denote by b(d) the first term in the asymptotic expansion of the bias of the maximum 
likelihood estimator in decreasing orders of information, usually sample-size. Call b(S) the 



first-order bias term, and let F(6) denote the expected information matrix for 5. Firth 

-F(6)b(S) then the solution of the adjusted score equations 



( fl993| ) showed that, if A(S) = 

U*{S) = U t (S) + A t (6) = (t = 1, . . . ,q + p) 



(5) 



results in an estimator that is free from the first-order term in the asymptotic expansion 
of its bias. 

4.2 Reduced-bias estimator 



Kosmidis and Firth (2009) exploited the structure of the bias-reducing adjusted score 
functions in ([5]) in the case of exponential family non-linear models. Using Kosmidis and 
Firth (2009, expression (9)) for the adjusted score functions in the case of multivariate 
generalized linear models, and temporarily omitting the argument S from the quantities 
that depend on it, the adjustment functions At in ^ have the form 

, n q 

At= -Y, m rY. tl \ Vr {( D ^r 1 ) s ®l q }V 2 (ir r ;Vr)]zrst (t = 1, ...,? + p) , (6) 



r=l 



s=l 



where V r = Z r F l Zj is the asymptotic variance-covariance matrix of the estimator for the 
vector of predictor functions r\ r = (ry r i, . . . , r] rq ) T and 7r r = (7i>i, . . . , 7r r q) T ■ Furthermore, 
V 2 (tt 

r ]rjr) is tri e q 2 x q matrix with sth block the hessian of 7v rs with respect to r] r 
(s = 1, . . . , q), In is the q x q identity matrix and is the q x q Jacobian of m r 7v, r with 
respect to rj r . A straightforward calculation shows that 



r 





gn 















~9rl 


9r2 ■ ■ 















~9t2 ' ' 






(r 








9rq-\ 















■ —9rq-l 


9rq . 





(r = 1, 



,n) 



The matrix E r is the incomplete qxq variance-covariance matrix of the multinomial vector 
Y r with (s,«)th component 



m r 7r rs (l — ir rs ) , s = u 

S / U 



(s,u = l,...,q;r = l,...,n). 



Substituting in ([6]), some tedious calculation gives that the adjustment functions At 
have the form 



n q 



r=l s=l 



■ V - ^ V - ^ / C rs C rs — l C rs +1 C rs . 
A t = Z^Z^9rs[ " " \Zrst (t=l,...,q + p) 



7T r 



Krs+1 



where 



Cro = c r fc = and c rs = -m r g' rs v rss (s = 1, . . . , q) , 



(7) 



(8) 
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with g' rs = g'(r] rs ), and g'(rj) = d 2 G(r])/dr] 2 , The quantity v rss is the sth diagonal compo- 
nent of the matrix V r (s = 1, . . . , q; r = 1, . . . , n). 

Substituting Q and ^ in ^ gives that the tth component of the bias-reducing 
adjusted score vector (i = 1, . . . , q + p) has the form 



n q 



Crs— 1 (5) + C rs+ i(8) - C rs (S) 



7T rs (5) 



Zrst • (9) 



The reduced-bias estimates Srb are such that U^{8rb) = for every t € {1 = 1, . . . , g+p}. 



Kosmidis (2007a Chapter 6) shows that if the maximum likelihood is consistent, then the 



reduced-bias estimator is also consistent. Furthermore, Srb has the same asymptotic 
distribution as 6, namely a multivariate Normal distribution with mean S and variance- 
covariance matrix F^ 1 (S). Hence, estimated standard errors for 6rb can be obtained 
as usual by using the square roots of the diagonal elements of the inverse of the Fisher 
information at Srb- All inferential procedures that rely in the asymptotic Normality of 
the estimator can directly be adapted to the reduced-bias estimator. 



4.3 Bias-corrected estimator 



Expression ([7j) can also be used to evaluate the first-order bias term as b(S) = —F~ 1 (6)A(S), 
where F(S) = Ylr=i Zj D r {8)T l ~ 1 {8)D^ {8)Z r . If 5 is the maximum likelihood estimator 
then 

S BC = 6 - b(S) (10) 

is the bias-corrected estimator which has been studied in Cordeiro and McCullagh ( 1991| ) 
for univariate generalized linear models. The estimator Sbc can be shown to have no 



first-order term in the expansion of its bias (see, Efron, 1975, for analytic derivation of 
this result). 



4.4 Models for binomial responses 

For k = 2, Y r i has a Binomial distribution with index m and probability ir r x, and Y r 2 
m r — Y r \. Then model ^ reduces to the univariate generalized linear model 

v 

G{ix r ) = a - Pt&rt (r = 1, ... ,n) . 
t=i 

From ([9]), the adjusted score functions take the form 

tt*(x\ rx\ f vn + cn{8) m r -y T i - c Tl {5) \ 

U t ("J = Z^Srl{0) <j ^ ^ } Z r it (t = l,...,p+l) 



r=l 



7T rl (<5) 



1 - 7T r l(5) 



Omitting the category index for notational simplicity, a re-expression of the above equality 
gives that the adjusted score functions for binomial generalized linear models have the form 



r=l 



u t( S ) = XT n (\_ ~ ^ ( Vr + ^T hr ~ mr7Tr ) Zrt ( t = 1 '---'P+ !) 



2w r 



(11) 



where w r = m r g 2 / '{7r r (l — 7r r )} are the working weights and h r is the rth diagonal com- 
ponent of the "hat" matrix H = ZF~ 1 Z T W , with W = diagjifi, . . . ,w n } and 
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The above expression agrees with the results in Kosmidis and Firth (2009, §4.3), where it 



is shown that for generalized linear models reduction of bias via adjusted score functions is 
equivalent to replacing the actual count y r with the parameter-dependent adjusted count 
y r + g' r hr/ (2w r ) (r = 1, . . . , n). 



5 Implementation 

5.1 Maximum likelihood fits on iteratively adjusted counts 

When expression Q is compared to expression Q, it is directly apparent that bias- 
reduction is equivalent to the additive adjustment of the multinomial count y rs by the 
quantity c rs (d) — c rs -i($) {s = 1, . . . , k; r = 1, . . . , n). Noting that these quantities depend 
on the model parameters in general, this interpretation of bias-reduction can be exploited 
to set-up an iterative scheme with a stationary point at the reduced-bias estimates: at each 
step, i) evaluate y TS + c rs (8) — c rs -i(8) at the current value of S (s = 1, . . . , q; r = 1, . . . , n), 
and ii) fit the original model to the adjusted counts using some standard maximum like- 
lihood routine. 

However, c rs (d) — c rs -i(S) can take negative values which in turn may result in fitting 
the model on negative counts in step ii) above. In principle this is possible but then the 
log-concavity of g(.) does not necessarily imply concavity of the log-likelihood function 



and problems may arise when performing the maximization in ii) (see, for example, Pratt 



1981, where the transition from the log-concavity of g(.) to the concavity of the likelihood 
requires that the latter is a weighted sum with non- negative weights). That is the reason 
why many published maximum likelihood fitting routines will complain if supplied with 
negative counts. 

The issue can be remedied through a simple calculation. Temporarily omitting the 
index r, let a s = c s — c s _i (s = 1, . . . , k). Then the kernel (y s + a s )/ir s — (y s + a s+ i)/7r s+ i 
in ([9]) can be re-expressed as 

y s + a s I(a s > 0) - ir s a s+1 I(a s+ i < 0)/7r s+ i y s+ i + a s+1 I(a s+1 > 0) - ir s+1 a s I(a s < 0)/ir s 



7T S 7Ts+l 

where 1(E) = 1 if E holds and otherwise. Note that 

a s (8)I{a s {8) > 0) - 7r s (S)a s+1 (d)I(a s+1 (S) < 0)/vr s+1 (<5) > , 

uniformly in S. Hence, if step i) in the above procedure adjusts y rs by a rs I(a rs > 0) — 
ir rs a rs+ il(a rs+ i < 0)/ir rs+ i evaluated at the current value of S, then the possibility of 
issues relating to negative adjusted counts in step ii) is eliminated, and the resultant 
iterative procedure still has a stationary point at the reduced-bias estimates. 

5.2 Iterative bias correction 

Another way to obtain the reduced- bias estimates is via the iterative bias-correction pro- 
cedure of Kosmidis and Firth (2010); if the current value of the estimates is 5^' then the 
next candidate value is calculated as 

= S {1+1) -b(s^) (i = 0,l,...), (12) 



where S (l+1) = $ (<) + F' 1 (d®) u(s®), that is S is the next candidate value for 



the maximum likelihood estimator obtained through a single Fisher scoring step, starting 
from tfW. 
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Iteration (12) generall y re quires more effort in implementation than the iteration de- 

Nevertheless, if the starting value is chosen to be the 



5.1 



scribed in the Subsection 
maximum likelihood estimates then the first step of the procedure in (|12|) will result in 



the bias-corrected estimates defined in (10). 



6 Additive adjustment of the multinomial counts 
6.1 Estimation of cumulative logits 

For the estimation of the cumulative logits a s = log{7 s / (1— 7s)} (s = 1, . . . , q) from a single 
multinomial observation y\, . . . , y^ the maximum likelihood estimator of a s (s = 1, . . . , q) 
is a s = log{R s /(m — R s )}, where R s = Ylj=i Y s is the sth cumulative count. The Fisher 
information for a\, . . . ,a q is the matrix of quadratic weig hts W = DY,- l D T . The matrix 
W is symmetric and tri-diagonal with non-zero components 



W s _i 



W ss = m 7 s (1 " 7s) 2 ( 1 + 1 

7s-i(l-7s-i)7s(l-7s) 



{8 = 1,... 



-ni- 



ls ~ Is- 1 



s = 2,...,q) 



1. By use of the recursion formulae in Usmani (1994) for the inversion 



with 70 = and 7^ 

of a tri-diagonal matrix, the sth diagonal component of F~ x = W~ x is l/(m7 s (l — 7 S )). 
Hence, using ^ and noting that g s = j s (l — 7 S )(1 — 27 s ) for the logistic link, c s = \ — j s 
(s = l,...,q). Substituting in ^ yields that reduction of bias is equivalent to adding 
1/2 to the counts for the first and the last category and leaving the rest of the counts 
unchanged. 

The above adjustment scheme reproduces the empirical logistic transforms a s = log{(i? s + 
l/2)/(m — R s + 1/2)}, which are always finite and have smaller asymptotic bias than a s 
(see Cox and Snell , [1989 §2.1.6, under the fact that the marginal distribution of R s given 
R k 



m is Binomial with index m and probability j s for any s G {1, . . . , q}). 



6.2 A note of caution for constant adjustments in general settings 



Since the works of Haldane ( 1955 ) and Anscombe ( 1956 ) concerning the additive modifi 



cation of the binomial count by 1/2 for reducing the bias and guaranteeing finiteness in 
the problem of log-odds estimation, the addition of small constants to counts when the 
data are sparse has become a standard practice for avoiding estimates on the boundary 



of the parameter space of categorical response models (see, for example Hitchcock, 1962 



Gart and Zweifel, 1967 Gart et al. , 1985 Clogg et al. , 1991 ). Especially in cumulative link 



models where g(.) is log-concave, if all the counts are positive then the maximum likelihood 



estimates cannot be on the boundary of the parameter space (Haberman, 1980) 



Despite their simplicity, constant adjustment schemes are not recommended for general 
use for two reasons: 



1. Because the adjustments are constants, the resultant estimators are generally not 
invariant to different representations of the data (for example, aggregated and dis- 
aggregated view), a desirable invariance property that the maximum likelihood esti- 
mator has, and which allows the practitioner not to be concerned with whether the 
data at hand are fully aggregated or not. 
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Table 2: Two alternative representations of the same artificial data set. 



Y 



X 


1 


2 


3 


4 


1/2 


8 


6 


1 





1/2 


10 





1 





1/2 


8 


1 













Y 






X 


1 


2 


3 


4 


-1/2 


8 


6 


1 





1/2 


18 


1 


1 






Example 6.1: For example, consider the two representations of the same data in 
Table [2} Interest is in estimating the difference /5 between logits of cumulative 
probabilities of the samples with x = —1/2 from the samples with x = 1/2. 

The maximum likelihood estimate of a 3 is +oo. Irrespective of the data represen- 
tation the maximum likelihood estimate of (3 is finite and has value —1.944 with 
estimated standard error of 0.895. Now suppose that the same small constant, say 
1/2, is added to each of the counts in the rows of the tables. The adjustment en- 
sures that the parameter estimates are finite for both representations. Nevertheless, 
a common constant added to both tables causes — in some cases large — differences 
in the resultant inferences for j3. For the left table the maximum likelihood estimate 
of (3 based on the adjusted data is —1.097 with estimated standard error of 0.678, 
and for the right table the estimate is —1.485 with estimated standard error of 0.741. 
If Wald-type procedures were used for inferences on (3 with a Normal approxima- 
tion for the distribution of the approximate pivot (/3 — f3)/S((3), where <S(/3) is the 
asymptotic standard error at (3 based on the Fisher information, then the p-value of 
the test /3 = would be 0.106 if the left table was used and 0.045 if the right table 
was used. 



2. Furthermore, the moments of the maximum likelihood estimator generally depend on 



the parameter values (see, for example Cordeiro and McCullagh 1991, for explicit 
expressions of the first-order bias term in the special case of binomial regression 



models) and thus, as is also amply evident from the studies in Hitchcock (1962) and 



Gart et al. (1985), there cannot be a universal constant which yields estimates which 



are optimal according to some frequentist criterion. 

Both of the above concerns with constant adjustment schemes are dealt with by using the 
additive adjustment scheme in Subsection 5.1 Firstly, by construction, the iteration of 



Subsection |5.1| yields estimates which have bias of second-order. Secondly, because the 
adjustments depend on the parameters only through the linear predictors which, in turn, 
do not depend on the way that the data are represented, the adjustment scheme leads to 
estimators that are invariant to the data representation. For both representations of the 
data in Table[2]the bias-reduced estimate of (3 is —1.761 with estimated standard error of 
0.850. 



7 Invariance properties of the reduced-bias estimator 
7.1 Equivariance under linear transformations 

The maximum likelihood estimator is exactly equivariant under one-to-one transforma- 
tions 4>{.) of the parameter 8. That is if 8 is the maximum likelihood estimator of 8 then, 
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the maximum likelihood estimator of (f>(S) is simply 4>(d). In contrast to d, the reduced- 
bias estimator 5rb is not equivariant for all cf>; bias is a parameterization-specific quantity 
and hence any attempt to improve it can violate exact equivariance. Nevertheless, Srb is 
equivariant under linear transformations cj)(S) = Ld, where L is a (p + q) X (p + q) matrix 
of constants such that ZL is of full rank and 6' = LS has ol-y < . . . < a' q . 

To see that, assume that one fits the multinomial model with 7 rs = G(rj' rs ) where 
Vrs = Yl P t=l &'t z rst (t = 1, • • • ) n, s = 1, . . . , q). Because S' = LS, rj' rs is a linear combination 
of S. Using expression ([9]), the tth component of the adjusted score function for S' is 

U t=2^2^9rs\ ~j ~, Zrst , (13) 

, r=1 s=1 I n rs n rs+l ) 

for t € {1, . . . ,p + q}, where d rs , 7r', s , g' rs are evaluated at 5' . Note that all quantities in 



(13) depend on <5' only though the linear combinations r]' rs . Thus, comparing (g to 
if 8rb is a solution of C/ t * =0 = 1, . . . ,p + g), then LSrb must be a solution of U[ = 
{t = l,...,p + q). 

The bias-corrected estimator defined in ( 10 ) can be shown also to be equivariant under 
linear transformations, using the equivariance of the maximum likelihood estimator and 
the fact that b(S) depends on 5 only through the linear predictors. 



7.2 Invariance under reversal of the order of categories 

One of the properties of proportional-odds models, and generally of cumulative link models 
with a symmetric latent distribution G(.) is their invariance under the reversal of the order 
of categories; a reversal of the categories along with a simultaneous change of the sign of 
(3 and change of sign — and hence order — to ai, . . . , a q in model ^ results in the same 
category probabilities. Given the usual arbitrariness in the definition of ordinal scales in 
applications this is a desirable invariance property for the analysis of ordinal data. 

The maximum likelihood estimator respects this invariance property. That is if the 
categories are reversed then the new fit can be obtained by merely using —(3 ML for the 
regression parameters and (— a g , . . . , — a{) for the cutpoints. 

The reduced-bias estimator respects the same invariance property, too. To see this, as- 

1, . . . ,n,s = 



sume that one fits the multinomial model with 1 — j rs = G(a.k- S — P x r) ( r 
l,...,q) with a,\ < . . . < a g . Because g(.) is symmetric about zero, G(r]) = 1 
G(-a k _ s + p T 



and so "y rs 
Hence, 8' = LS with 



G{Yltti S'tZrst) where 



This is a reparameterization of model (p 



8 1 = («;,.. .,a' /3[,...,P') T = (- 



■a, 



-at, -P\ 



-G(-v), 
to 7 rs = 

• , -P P ) T . 



L 




-1 






and based on the results of Subsection 



7.1 



RB 



LS RB (and also 5 B c = L5bc) 



8 Properties of the reduced-bias estimator and associated 
inferential procedures: a complete enumeration study 



8.1 Study design 

The frequentist properties of the reduced-bias estimator are investigated through a com- 
plete enumeration study of 2 x k contingency tables with fixed row totals. The rows of 
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the tables correspond to a two-level covariate x with values x\ and X2, and the columns 
to the levels of an ordinal response Y with categories 1, . . . , k. The row totals are fixed 
to m\ for x = x\ and to m 2 for x = x 2 . The right table in Table [2] is a special case with 
k = 4, x% = —1/2, X2 = 1/2, and row totals m\ = 15, m 2 = 20. The present complete 
enumeration involves ( mi+ i\ ( m 2+i\ tables. We consider a multinomial model with 

V mi ) \ m,2 I 

7is = G(a s - Pxi) , (14) 
7 2s = G(a s - /3x 2 ) {s = l,...,q), 

where ai,...,a q are regarded as nuisance parameters but are essential to be estimated 
from the data, because they allow flexibility in the probability configurations within each 
of the rows of the table. 

For the estimation of f3 we consider the maximum likelihood estimator /3, the bias- 
corrected estimator $bc, the reduced-bias estimator 0r.r, an d the generalized empirical 



logistic transform Pel which is defined in McCullagh (1980, §2.3) and is an alternative 



estimator with smaller asymptotic bias than the maximum likelihood estimator specifically 
engineered for the estimation of /3 in 2 x k tables with fixed row totals. The estimators 
P, Pbc and Prb are the /3-components of the vectors of estimators 5, 8bc and <$rb, 
respectively, where S = (ai, . . . , a q , (3) T is the vector of all parameters. The estimators are 
compared in terms of bias, mean-squared error and coverage probability of the respective 
Wald-type asymptotic confidence intervals. The following theorem is specific to 2 x k and 
cumulative link models, and can be used to reduce the parameter settings that need to be 
considered in the current study for evaluating the performance of the estimators. 

Theorem 8.1: Consider a2xk contingency table T with fixed row totals mi and m 2 , and 



the multinomial model that satisfies (14)- Furthermore, consider an estimator 5*(T) of 
d, which is equivariant under linear transformations. Then if mi = m 2 , the bias function 
and the mean squared error of j3*(T) satisfy 

E{p*(T) - P; P, a) = -E{p*(T) + P; -p, a) , and 

E {(P*(T) - P) 2 ;p,a} = E {(p*(T) + P) 2 ;-P,a} , respectively. 

Proof. Define an operator R which when applied to T results in a new contingency table 
by reversing the order of the rows of T. Hence, R(R(T)) = T. 

Because S*(T) is equivariant under linear transformations, it suffices to study the 
behaviour of P*(T) when x\ = —1/2 and x 2 = 1/2. Then, any combination of values for 
x\ and x 2 results by an affine transformation of the vector (—1/2, 1/2), and equivariance 
gives that a corresponding translation of the vector ct* (T) and change of scaling of P* (T) 
results in exactly the same fit. Hence, the shape properties of P*(T) remain invariant to 
the choice of {xi,x 2 ) T ■ 

Denote with T the set of all possible 2 x k tables with fixed row totals m\ and m 2 . By 
the definition of the model, P(T; P, a.) = P(R(T); — /3, a) for every T E T ■ Because m\ = 
m 2 there is a subset £ C T of tables with (yu, . . . , y\k) = (2/21, • • • > V2k)- The complement 
of £ can be partitioned into the sets J-\ and J- 2 which have the same cardinality, and where 
T G F\ if and only if R(T) £ T 2 . For x\ = —1/2 and x 2 = 1/2, equivariance under the 
linear transformation 4>{P) = ~P gives that P*(T) = —P*(R(T)). Then, for any T € £, 



12 



13* (T) = 0. Hence, 



E(f3*(T);f3, a ) = J2P*(T)P(T;P,<x) (15) 

T<££ 

= P*(F){P(T;P,a)-P{R(T);P,a)} 

= Yl Z 3 *^) i P ( R (n -P, ct) - P(T; -(3, a)} = -E(f3*(T); -/?, a) 

Adding —j3 to both sides of the above equality gives the identity on the bias. For the 
identity on the mean squared error one merely needs to repeat a corresponding calcu- 



lation to (|l5j) starting from E{((3*(T) - /3) 2 ;/3,q} = £ T g £ 09*(T) ~ PrP(T;P,a) + 
/3 2 £ T6£ P(T;/W □ 

A similar line of proof can be used to show that if m\ = rri2 the coverage probability 
of Wald-type asymptotic confidence intervals for (3 is symmetric about f3 = 0, provided 
that the estimator S(T) of the standard error of (3*{T) satisfies S(T) = S(R(T)). 

8.2 Special case: Proportional odds model 

For demonstration purposes, the values of the competing estimators are obtained for a 
proportional odds model (G(r/) = exp(ry)/{l + exp(ry)}) with x\ = —1/2 and X2 = 1/2 and 
k = 4, for each of the 400, 3136 and 81796 possible tables with row totals m = m\ = m2, for 
m = 3, m = 5, and m = 10, respectively. All estimators considered are equivariant under 



linear transformations and hence, according to the proof of Theorem 8.1, the outcome of 
the complete enumeration for the comparative performance of the estimators generalizes 
to any choice of {x\,X2) T ■ 

The estimators j3 and Pub are not available in closed form and one needs to rely 
on iterative procedures for finding the roots of Ut(S) and U£(5), respectively, for every 
t £ {1,2,3,4}. Fisher scori ng is used to obtain and the iterative maximum likelihood 



approach of Subsection 5.1 is used for (3rb- The maximum likelihood estimate is judged 
satisfactory if the current value 8° of the iterative algorithm satisfies |[/t(<5 c )| < 10~ 10 for 
every t E {1, 2, 3, 4}. For (3rb, the latter criterion is used with U£ in the place of Ut- 
For evaluating the performance of the estimators, the probability of each of the tables 



has been calculated under model (14), for parameter values that are fixed according to 
the following scheme. The parameter /3 takes values on some sufficiently fine equi-spaced 
grid in the interval [—6,0]. For /3 in the interval (0,6] the results can be predicted by the 
symmetry relations of Theorem |8.1| For each value of f3, the nuisance parameters take 
values (an, a 2 ,a 3 ) T = e(-l,0,l) T for e G {1,2,3,5,7}. Figure is a pictorial representa- 
tion of the probability settings for the two multinomial vectors in the 2x4 contingency 
table with fixed row totals, at each combination of values for (3 and (a\, ct2, ol-^) t . Under 
the above scheme for fixing parameter values, the probability of the end categories tends 
to zero as e increases, and hence more extreme probability settings are being considered 
as e grows. 

The findings of the current complete enumeration exercise are outlined in the following 
Subsection. The same complete enumeration design has been applied to a number of 
settings, with m\ ^ rri2, with different link functions, with different numbers of categories, 
and/or for different non-symmetric specifications for the nuisance parameters (results not 
shown here) yielding qualitatively the same conclusions; the current setup merely allows a 
clear pictorial representation of the findings on the behaviour of the reduced-bias estimator. 
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Figure 1: A pictorial representation of the probability settings considered in the calcu- 
lation of expectations from the complete enumeration study. The left hand side of each 
plot depicts the multinomial probabilities for x = —1/2 and the right the multinomial 
probabilities for x = 1/2. The 8 probabilities (4 for each x value) for each particular 
combination of values for j3 and (011,012,013) are connected with line segments. Hence each 
piecewise linear function on each plot corresponds to a specific probability setting for the 
2x4 contingency table with fixed row totals. The plots correspond to particular settings 
for the nuisance parameters (ai, (22,013) determined by e(— 1,0, 1), and each plot contains 
all possible piecewise linear functions for the values of (3 on an equi-spaced grid of size 50 
in the interval [—6,6]. 




cy cy cy cy cy cy £>' cy cy cy cy cy 5S' cy cy cy cy cy cy cy cy cy cy cy 



Covariate value | Category 



An R script that can produce the results of the current complete enumeration for any 
number of categories, any link function, any configuration of totals and any combination 
of parameter settings in 2 x k contingency tables is available in the supplementary material. 

8.3 Remarks on the results 

Remark 1. On the estimates of at, «2 and 03: According to Section |3j for data sets 
where a specific category s E {1, 2, 3, 4} is observed for neither x = —1/2 nor x = 1/2, the 
maximum likelihood estimate of a is on the boundary of the parameter space as follows: 



s 


= 1 : 


ax 


= —00 


s 


= 2 : 


a-2 


= &i 


s 


= 3 : 


0:3 


= 612 


s 


= 4 : 


0:3 


= +00 



A least for log-concave g( ), according to the results in Pratt (1981), the above equa- 
tions extend directly to the case of any number of categories and number of covariate 
settings and can directly be used to check what happens when two or more categories are 
unobserved. 

Nevertheless, the maximum likelihood estimator of (3 is invariant to merging a non- 
observed category with either the previous or next category and can be finite even if 
some of the a parameters are on the boundary of the parameter space. Hence, maximum 
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Figure 2: Probability of infinite estimates (top), conditional biases (middle) and condi- 
tional mean squared errors (bottom) of $ and 0bc for the parameter settings considered 
in the complete enumeration study. 
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likelihood inferences on (3 are possible even if a category is not observed. The same 
behaviour is observed for the reduced-bias estimators of cti, 02, 03 with the difference that 
if the non-observed category is s = 1 and/or s = 4, then ck^bb and/or o^bb are finite. 
A special case of this observation has been encountered in Subsection |6.1| where reduction 
of the bias corresponds to adding 1/2 to the end categories, guaranteeing the finiteness 
of the cumulative logits. Hence, there is no need for non-observed end-categories to be 
merged with the neighbouring ones when the reduced-bias estimator is used. If any of 
the other categories is empty, then the reduced-bias estimator of f3 is invariant to merging 
those with any of the neighbouring ones. 

It should be mentioned here that if both the second and the third category are empty 
then the reduced-bias estimate of (3 and the generalized empirical logistic transform are 
identical. To see that, note that in the special case of logistic regression, the adjusted 



scores in Subsection 4.4 suggest adding half a leverage to each of y T \ and y r2 [r = 1,2) 



(this result for logistic regressions was obtained in Firth, 1993). Furthermore, the model 
with q = 1 is saturated and hence both leverages are 1. Hence the reduced-bias estimate of 
(3 coincides with the generalized empirical logistic transform, which for k = 2 is log{(yn + 
l/2)/(mi - yn + 1/2)} - log{(y 2 i + l/2)/(m 2 - y 2 i + 1/2)}. 

Remark 2. On f3 and Pbc : As is expected from the discussion in Section [3J the 
maximum likelihood estimator of (3 is infinite for certain configurations of zeros on the 
table, and for such configurations the bias-corrected estimator is also undefined owing to 
its explicit dependence on the maximum likelihood estimator. Hence, for f3 and /3bc> the 
bias function is undefined and the mean squared error is infinite. A possible comparison 
of the performance of f3 and (3bc is m terms of conditional bias and conditional mean 
squared error where the conditioning event is that f3 has a finite value. 

For detecting parameters with infinite values the diagnostics in Lesaffre and Albert 



(1989, §4) for multinomial logistic regressions are adapted to the current setting. Data 
sets that result in infinite estimates for (3 have been detected by observation of the size of 
the corresponding estimated standard error based on the inverse of the Fisher information, 
and by observation of the absolute value of the estimates when the convergence criteria 
were satisfied. If the standard error was greater than 200 and the estimate was greater 
than 100, then the estimate was labelled infinite. A second pass through the data sets 
has been performed making the convergence criterion for the Fisher scoring stricter than 
\U t (d c )\ < 10~ 10 . The estimates that were labelled infinite using the aforementioned 
diagnostics, further diverged towards infinity while the rest of the estimates remained 
unchanged to high accuracy. 

The probability of encountering an infinite (3 for the different possible parameter set- 
tings is shown at the top row of Figure [2j For (3 E (0, 6) the probability of encountering 
an infinite value is simply a reflection of the probability in (—6,0). As is apparent the 
probability of infinite estimates increases as e increases and for each value of e it increases 
as \(3\ increases. As is natural as m increases, the probability of encountering infinite 
estimates is reduced but is always positive. 

Of course, the findings from the current comparison of f3 with /3bc should be interpreted 
critically, bearing in mind the conditioning on the finiteness of /?; the comparison suffers 
from the fact that the first-order bias term that is required for the calculation of (3bc 
is calculated unconditionally. The comparison is fairer when the probability of infinite 
estimates is small; this happens on a region around zero whose size also increases as m 
increases. 

The conditional bias and conditional mean squared error of (3 and (3bc are shown in 
the left and right of the second row of Figure [2] The identities in Theorem |8.1| apply 
also for the conditional and conditional mean squared error; to see this set P to be the 
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Figure 3: Biases (top) and mean squared errors (bottom) of Pel and Pre for the parameter 
settings considered in the complete enumeration study. 
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conditional probability of each table in the proof of Theorem 8.1 Hence, for (3 G (0,6), 
the conditional bias is simply a reflection of the conditional bias for P £ (—6,0) across the 
45° line, and the conditional mean squared error is a reflection of the conditional mean 
squared error for f3 £ (—6, 0) across /3 = 0. 

The behaviour of the estimators in terms of conditional bias is similar, with the maxi- 
mum likelihood estimator performing slightly better than Pec f° r small m. As m increases 
the bias corrected estimator starts performing better in terms of bias in a region around 
zero, where the probability of infinite estimates is smallest. The same is noted for the 
conditional mean squared error. The estimator (5bc performs better than f3 in a region 
around zero, whose size increases as m increases. The same behaviour as for e = 7 persists 
for larger values of e (figures not shown here). 

Remark 3. On Pel and $rb' The estimators Pel and Prb, always have finite value 
irrespective of the configuration of zeros on the table. Hence, in contrast to $ and Pbc-, 
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Figure 4: Coverage probabilities of nominally 95% asymptotic Wald-type confidence in- 
tervals for (3, based on (3 and (3bc (top) and (3 el and (3rb (bottom) and the respective 
standard errors, for (3 G [-10, 0) and a = e(-l, 0, 1) T for e G {1, 2, 3, 5, 7}. 
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a comparison in terms of their unconditional bias and unconditional mean squared error 
is possible. The left plot of Figure [3] shows the bias function of the estimator for the 
parameter settings considered in the complete enumeration study. For f3 G (0, 6), the bias 
function is simply a reflection of the bias for f3 G (—6, 0) across the 45° line, and the mean 
squared error is a reflection of the mean squared error for j3 G (—6, 0) across (3 = 0. 

The reduced-bias estimator performs better than (3 el m terms of bias for small values 
of e and the differences in the bias functions diminish as e increases. A similar limiting 
behaviour holds for their mean squared errors, though for small values of e, /3el performs 
slightly better than (3br in terms of mean squared error in the range (—4, 4) and worse 
outside that range. The mean squared error of both estimators converges to zero as m 



increases, which is what is expected from consistent estimators (see, Kosmidis, 2007a, §6.3 
for a proof of the consistency of the reduced-bias estimator) . 
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Remark 4. On the coverage of 95% Wald confidence intervals For a table T and 
an estimator f3*(T), consider the nominally 100(1 — a)% Wald-type confidence interval for 

P 

f3*(T)± Zl _ a/2 S*(T), 

where z a is the lOOath quantile of a standard normal distribution and S*(T) is the es- 
timator of the standard error of (3*{T). For /3, Pbc and Prb, S*(T) is taken to be the 
square root of the diagonal element of the inverse of the Fisher information corresponding 
to (3, evaluated at $(T), Pbc(T) and /3rb(T), resp ectively. For the es timation of the stan- 



dard error for Pel, the variance formula given in |McCullagh (1980, §2.3) is used. If the 



maximum likelihood estimate is infinite then we make the convention that the confidence 
intervals based on /3 and Pbc are (— oo, oo). Figure [4] shows the coverage probabilities of the 
four competing intervals for a = e(— 1,0, 1) T with e G {1,2,3,5,7}, and for /3 G [—10,0]. 
The coverage probability for f3 G (0, 10) is simply a reflection of the coverage probability 
for p G (-10,0) across ,9 = 0. 

Wald-type confidence intervals based on the maximum likelihood estimator demon- 
strate a conservative behaviour in terms of coverage, and the coverage probability con- 
verges to 1 as |/3| — > 00. Furthermore, the coverage probability seems to uniformly get 
closer to the nominal level as m increases. The intervals based on the bias-corrected esti- 
mator also demonstrate a conservative behaviour in a neighbourhood around /3 = 0, then 
tend to undercover for an interval of large |/3| values, and as for /3, when |/3| — > 00 the 
coverage probability tends to 1. 

A more dramatic undercoverage is present for confidence intervals based on Pel when 
|/?| is large. Actually after some value of \f3\ the confidence intervals based on Pel com- 
pletely lose coverage (the full range of the coverage probability is not shown here). On 
the other hand, those intervals behave satisfactorily around (3 = 0. This behaviour re- 
lates to the fact that the variance estimator for Pel is obtained under the assumption 
that P = and can seriously underestimate the variance of Pel when |/3| is larger than 



about 1 (the same observation is also made in McCullagh, 1980, §2.3). Furthermore, it is 



worth noting that the point where coverage is lost completely moves closer to zero as m 
increases. Hence, use of Wald-type confidence intervals based on Pel is not recommended 
in practical applications. 

Apart from being conservative, confidence intervals based on Pub seem to behave better 
for a wider range of P around zero, but also completely lose coverage after some value of 
\P\. The complete loss of coverage for large effects is due to an interplay of discreteness 
of the response and the fact that Prb and Pel take always finite values. Specifically, 
for any finite m there is only a finite number of possible Wald-type confidence intervals 
because the response is multinomially distributed, and any of those confidence intervals 
has finite endpoints. Therefore, there will always be a large enough value of \P\ which is 
not contained in any of the confidence intervals resulting to a complete loss of coverage. 
Nevertheless, in contrast to Pel, the coverage properties of the Wald-type confidence 
intervals based on Prb improve quickly and the value where coverage is lost moves quickly 
away from zero as m increases. This is because the cardinality of the set of the possible 
confidence intervals increases and the approximation of the necessarily discrete distribution 
of the reduced-bias estimator by a Normal distribution with variance the inverse of the 
Fisher information gets more accurate. This results in the increasing accuracy of the 
approximation of the distribution of the Wald-pivot by a Normal distribution. 

As the current study demonstrates the Wald-type confidence intervals based on any of 
the estimators do not behave satisfactorily for the whole range of P and for small sample 
sizes. For this reason current research focuses on alternative confidence intervals that can 
have one infinite endpoint (see Section 12). Until conclusive results are produced, Wald- 
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Table 3: Parameter estimates and corresponding estimated standard errors (in paren- 
thesis) from fitting a proportional odds model and a proportional hazards model of the 
form ( |14[ ) to the artificial data considered in Example 6.1, using maximum likelihood and 
bias reduction. 



Model 


Parameter 


Maximum likelihood 


Bias reduction 




P 


-1.944 (0.895) 


-1.761 (0.850) 


Proportional odds 




1.187 (0.449) 


1.084 (0.428) 




a 2 


3.096 (0.787) 


2.781 (0.701) 




a 3 


oo (oo) 


4.457 (1.440) 




P 


-0.689 (0.401) 


-0.635 (0.389) 


Proportional hazards 


Oil 


0.313 (0.220) 


0.297 (0.219) 


(G(ry) = 1 - exp{- exp(?7)}) 


a 2 


1.097 (0.260) 


1.013 (0.246) 




a 3 


oo (oo) 


1.518 (0.357) 



type confidence intervals based on the reduced-bias estimator can still be used in practice 
as asymptotically correct, bearing in mind that, they will be generally slightly conservative 
for moderate effects (like the ones based on the maximum likelihood estimator) especially 
in small samples, and also that their coverage properties will deteriorate for extremely 
large effects. 



9 Shrinkage towards a binomial model for the end-categories 

Table [3] shows the maximum likelihood estimates, the reduced-bias estimates and the 
corresponding estimated standard errors from fitting a proportional odds model and a 



proportional hazards model of the form (14) to the artificial data considered in Exam- 



ple 6.1 



There is apparent shrinkage of the reduced-bias estimates towards zero, which implies 
a shrinkage of the cumulative probabilities towards G(0). This implies a shrinkage of 
the probabilities for the first and the last category of the ordinal scale towards G(0) and 
1 — G(0) respectively, and a corresponding shrinkage of the probabilities of the intermediate 
categories towards zero. 

To investigate further the apparent shrinkage effect, the maximum likelihood and 



reduced-bias estimates of proportional odds and proportional hazards models of the form ( 14 ) 
are obtained for every possible 2x6 table with row totals mi = m-i = 3. This setting is 
chosen because it is one that results in sparse tables, allowing the construction of plots of 
fitted probabilities that are not massively overcrowded (under this setting there are 3136 
tables to be estimated). 

For each category of the ordinal response, Figure [5] shows the fitted probabilities based 
on the reduced-bias estimator against the fitted probabilities based on the maximum 
likelihood estimator. The grey areas are where the points would all be expected to lie 
if the shrinkage relationships were strictly satisfied for each pair of fitted probabilities. 
Clearly this is not the case. 

The points on the plots for the first category roughly lie slightly above the 45° line 
for fitted values less than G(0), and slightly below it for fitted values greater than G(0). 
The points for the last category exhibit similar behaviour but with G(0) replaced by 
1 — G(0). The shrinkage effect appears to be stronger the further the probability is from 
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Figure 5: The fitted probabilities based oj 
fitted probabilities based on the maximum 
the response. The top row corresponds to 
to a proportional hazards model. The grey 
to lie if shrinkage was strict. 
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the shrinkage points G(0) and 1 — G(0). 

The points on the plots for the intermediate categories lie mostly under the 45° line, ex- 
cept in cases where the maximum likelihood fitted probability is very close to zero. Hence, 
the fitted probabilities for the intermediate categories based on the reduced-bias estimator 
tend to shrink towards zero. The plots also suggest that the further the probability is from 
zero the stronger is the shrinkage effect. 

The shrinkage properties observed here are a direct generalization of the shrinkage 
that is implied by improving bias in the estimation of binomial logistic regression models 
(Copas 1988; Cordeiro and McCullagh, 1991 Firth 1992) to links other than the logistic 
and to models with ordinal responses. 

Corresponding empirical investigations of shrinkage based on both complete enumera- 
tions and simulations under models fitted to real data have also been performed but are 
not shown here. The results are qualitatively the same: reduction of bias in cumulative 
link models shrinks the multinomial model towards a binomial model that has probability 
G(0) for the first category and probability 1 — G(0) for the last category. 
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Table 4: Estimated biases, mean-squared errors (MSE) and coverage probabilities of 95% 
Wald-type confidence intervals from a simulation of size 10 5 under the maximum likelihood 
fit of model (16). The last column shows the estimated relative increase of the mean 
squared error from its absolute minimum (the variance) due to bias. The relative increase 
of the mean squared error is the square of the bias divided by the variance. The estimated 
simulation error is less than 0.004 for the bias and the MSE estimates and less than 0.001 
for the coverage estimates. 



Method 


Parameter 


Bias 


MSE 


Coverage 


Bias 2 /Variance (in %) 




Pi 


0.132 


0.142 


0.943 


13.928 




Pi 


0.055 


0.062 


0.943 


5.203 


Maximum likelihood 


P 3 


0.208 


0.722 


0.947 


6.347 




Pi 


0.004 


0.630 


0.944 


0.003 




P 5 


0.077 


0.238 


0.944 


2.569 




Pi 


-0.001 


0.106 


0.948 


0.002 




Pi 


0.001 


0.051 


0.953 


0.001 


Bias correction 


P 3 


-0.004 


0.577 


0.954 


0.002 




Pi 


0.003 


0.551 


0.956 


0.001 




P 5 


0.001 


0.205 


0.954 


0.000 




Pi 


0.002 


0.107 


0.949 


0.002 




Pi 


0.002 


0.051 


0.953 


0.007 


Bias reduction 


P 3 


0.002 


0.579 


0.954 


0.001 




Pi 


0.004 


0.553 


0.956 


0.003 




P 5 


0.003 


0.205 


0.954 


0.003 



10 A simulation study 



In order to further illustrate the properties of the reduced-bias estimator in more complex 
scenarios than the one in the complete enumeration study of Section [8l a simulation study 



is set-up based on part of the data that has been analyzed in Jackman (2004). The data is 



publicly available through the R package pscl (Jackman 



the data available for rater Fl in the analysis in Jackman 



2012) and seems to agree with 



(2004). The data contains the 



score of rater Fl for 106 applications to the Political Science PhD Program at Stanford 
University along with corresponding applicant-specific observations. The rater's score is 
on a five-point integer- valued ordinal scale from 1 to 5, with 1 indicating the lowest rating 
and 5 indicating the highest rating. Consider that the cumulative log-odds for rating s for 
the rth candidate is modelled as 



log 



l-7r 



a. 



-&\X r x-fiiX r i-foZr\-fiiZr2-fo9r (r = 1, . . . , 106; s = 1, . . . , 4) , (16) 



where x r \ and x r 2 are the rth applicant's scores on the quantitative and verbal section of 
the GRE, respectively (after subtracting the respective mean and dividing by the respective 
standard deviation), z r \ and z r 2 are dummy variables indicating whether the rth applicant 
has an interest in American politics and Political Theory, respectively (with 1 representing 
a positive reply and a negative one), and g r is the gender of the rth applicant (r = 
1, . . . , 106). The parameter a\, . . . , as are the cutpoints and . . . , /?5 describe the effect 
of the corresponding applicant-specific covariates on the cumulative log-odds. 



Model ( 16 ) is fitted using maximum likelihood and the maximum likelihood estimates 
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of fa, . . . ,fa are 1.993, 0.892, 2.816, 0.009, 1.215 respectively indicating that an increase 
in the value of any of the covariates is associated with higher probability for high ratings 
holding all else in the model fixed. Then an extensive simulation under the maximum 
likelihood fit is performed for estimating the biases, mean squared errors and coverage 
probabilities of Wald-type 95% confidence intervals for fa, . . . ,fa when maximum likeli- 
hood, bias correction and bias reduction is used. There have been instances of simulated 
data sets where one or more rating categories were empty. In those cases, empty cate- 
gories were merged with neighbouring ones according to the discussion in Remark 1 of 
Section [8} The results are shown on Table [4j There was only one data set for which the 
maximum likelihood estimate of fa was +oo. This data set was excluded when estimating 
the bias, mean squared error and coverage probability for the maximum likelihood and 
the bias-corrected estimator and hence the corresponding figures in the table estimate the 
conditional respective quantities (that is given that the maximum likelihood estimator has 
finite value). On the other hand, the reduced-bias estimates were finite for all datasets 
and hence the corresponding figures are estimates of the targeted unconditional quanti- 
ties. In this particular setting, the probability of the conditioning event is rather small 
and a direct comparison of the estimated conditional and unconditional quantities can be 
informative. 

Temporarily ignoring the fact that the maximum likelihood estimator can be infinite, 
both the bias corrected and reduced bias estimators perform equally well in the current 
study. Furthermore, the figures in Table [4] demonstrate a significant reduction both in 
terms of bias and mean squared error when either bias correction or bias reduction is 
used. In the current study the effect of estimation bias is quite significant; the mean 
squared errors of the components of the maximum likelihood estimator are inflated by as 
much as 13.9% due to bias from their minimum values (the variances). The corresponding 
inflation factors for the bias corrected and reduced-bias estimators are quite close to zero, 
which when combined with the observed reduction in mean squared error illustrates the 
benefits that the reduction of bias can have in the estimation of such models. Lastly, a 
slight improvement in the coverage properties of Wald-type confidence intervals is observed 
when the bias is corrected. 

Overall and taking into account that the reduced-bias estimator is always finite the 
current study illustrates its superior frequentist properties from the alternatives. 



11 Wine tasting data 

The partial proportional odds model of Example |1.1| is here refitted using the reduced-bias 
estimator. The result is shown in Table [H All estimates and estimated standard errors 
are finite. A Wald statistic for testing departures from the assumption of proportional 
odds via departures from the hypothesis fa = fa = fa = fa is 

W = (L~p RB fl(6 RB )Lp RB , 



where 



L 



1 






is a matrix of contrasts of fa The matrix 



1(5) = {lF^(S)L t } 1 , 
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Table 5: The reduced-bias estimates for the parameters of model ([T]), the correspond- 
ing estimated standard errors (in parenthesis) and the values of the Z statistic for the 
hypothesis that the corresponding parameter is zero. 



Parameter RB estimates Z statistic 



Oil 


-1.19 


(0.50) 


-2.40 


a- 2 


1.06 


(0.44) 


2.42 


a 3 


3.50 


(0.74) 


4.73 




5.20 


(1.47) 


3.52 


A 


2.62 


(1.52) 


1.72 


Pi 


2.05 


(0.58) 


3.54 


p3 


2.65 


(0.75) 


3.51 


Pi 


2.96 


(1.50) 


1.98 


e 


1.40 


(0.46) 


3.02 



is the inverse of the variance-covariance matrix of the asymptotic distribution of Lf3 RB , 
where F^^(S) is the /3-block of the inverse of the Fisher information. By the asymptotic 
normality of (3rb, W has an asymptotic x 2 distribution with 3 degrees of freedom. The 
value of W for the data in Table [T] is 0.7502 leading to a p-value of 0.861, which provides 
no evidence against the proportional odds assumption. This is also apparent from Table [5] 
where the reduced-bias estimates of Pi, P2, Ps, /?4 are comparable in value. 

It is worth noting that, in contrast to the output reported in Example |1.1[ the values 
of the Z statistics for Pi and P& are far from being exactly zero. 



12 Concluding remarks and further work 

Based on the results of the complete enumeration study, Pre appears to be always finite in 
contrast to Pml and Pbc-, and also to have comparable behaviour to Pel in terms of bias 
and mean squared error. Furthermore, Wald-type asymptotic confidence intervals based 
on Pub behave satisfactorily, maintaining good coverage properties for a wide range of P 
values. A complete loss of coverage is still present but the point where this happens is 
far away from zero and diverges as the number of observations increases. The application 
of the current complete enumeration setup for complementary log-log and probit link 
functions, for varying values of the row totals, and for different numbers of categories, 
resulted in qualitatively the same conclusions. 

In Remark 1 of the complete enumeration study, the finiteness of the reduced-bias 
estimates for a\ and/or a q was noted even in cases where the first and/or last category 
of the ordinal variable is not observed. This behaviour has been also encountered in the 



simulation study of Section 10 and in all of the many settings where the reduced-bias 
estimator has been applied, and is defensible from an experimental point of view. When 
the experimenter sets an ordinal scale, the end-categories of that scale largely determine 
the possible responses. Hence, one might argue that the end categories should play a bigger 
role than the intermediate categories in the analysis, and a good estimation method should 
not be as democratic as maximum likelihood is in this respect; accepting that the ordinal 
scale is well-defined, if an end category is not observed then it seems more appropriate to 
slightly inflate its probability of occurrence, instead of setting it to zero as the maximum 
likelihood estimator would do. 

The latter point of view does not only apply for non-observed end-categories. It applies 
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to all analyses of ordinal data through cumulative link models and is reinforced by the fact 
that an improvement in the frequentist properties of the maximum likelihood estimator 
resulted in the shrinkage of the cumulative link model towards a binomial model for the 
end-categories. 

The above observations, along with the fact that Srb respects the invariance properties 
of the cumulative link model and can be easily obtained using the procedures in Section [5j 
provide a strong case for its routine use in the estimation of cumulative link models. 



Laara and Matthews ( 1985 ) demonstrate the equivalence of continuation ratio models 
with complementary log-log link and proportional hazards models in discrete time. Hence, 
the reduced-bias estimates for the regression parameters of the former can be obtained by 
using the results in the current paper for the latter. 

The investigation of confidence intervals that maintain good properties without suffer- 
ing from a complete loss of coverage for extreme effects is the subject of future work. Cur- 
rent research focuses on the use of profiles of the asymptotic pivot {U* (8)} T F^ 1 (S)U* (5) 
which can be shown to have an asymptotic x 2 distribution, and the combination of the 
resultant intervals with the profile likelihood intervals. In this way confidence intervals 
with one infinite endpoint are possible and are suggested to accompany the reduced-bias 
estimator which appears always to take finite value. Such intervals seem to better re- 
flect uncertainty when extreme settings are considered, and lead to improved coverage 
properties without loss of coverage. 



As is done in Subsection 1 1 , comparison of nested models can be performed using 
asymptotic Wald-type test based on the reduced-bias estimator. Another option is the 
use of the adjusted score statistic 

where S- are the estimates under the hypothesis that results in the smaller model, and 
U*(S) and F(S) are the vector of adjusted score functions and the Fisher information 
of the larger model, respectively. The fact that U*(6) = U(S) + A(S) where A(6) = 
0(1) guarantees an asymptotic x 2 distribution for that statistic. For the example in 



Subsection [IT] the value of the adjusted score statistic is 0.9357 on 3 degrees of freedom 
giving a p-value of 0.8168 which leads to qualitatively the same conclusion as that of the 
Wald test. When testing departures from the proportional odds assumption in general, 
the adjusted score statistic has the same disadvantage as the ordinary score statistic; the 
Fisher information matrix for the partial proportional odds model can be non-invertible 
when evaluated at the estimates of the corresponding proportional odds model. 
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Supplementary material 



The accompanying supplementary material includes an R script (R Development Core 



Team, 2012) that can be used to produce the results of the complete enumeration study 
for any number of categories, any link function and any configuration of row totals in 
contingency tables. The current version of the R function bpolr is also included. The 
bpolr function fits cumulative link models and their extensions with dispersion effects 
either by maximum likelihood, or bias reduction or bias correction. An updated version 



of the function will be part of the next major release of the R package brglm (Kosmidis 



2007b). Scripts that reproduce the data analyses undertaken in the paper are also available 



in the supplementary material. 
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